End-to-end document table extraction and summarization pipeline leveraging computer vision and natural language processing technique¶

Below is brief outline of the steps that need to be executed for the workflow

  1. Read document (pdf) from a given location.
  2. Split the document into pages and covert each page into png/jpg format and save with appropriate naming convention
  3. Use trained&deployee YOLO model to mark table boundaries
  4. Extract sub Images by cutting those tables out of original images.
  5. Perform OCR on tables and extract the text matrix
  6. Send the text to openai and generate a summary of the table.

In [3]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [4]:
%cd /content/drive/MyDrive/yolov8
/content/drive/MyDrive/yolov8

1. Read document (pdf) from a given location¶

To read a document on goole colab that isIt is easy to upload document to google drive but To read a document that is not on Google Drive and we don't want to upload it manually each time we can use Google Colab following steps

  • Upload Directly to google drive and note the path where the document is uploaded : go to the folder and upload the document "/content/drive/MyDrive/yolov7/Pdf_document/AutomateDocument.pdf"
  • If is not on Google Drive and we don't want to upload it manually each time and its availabe as a online URL document (see below)
In [52]:
os.mkdir("/content/drive/MyDrive/yolov8/Table-Extraction-PDF-2/Annotated")

2. Split the document into pages and covert each page into png/jpg format and save them with appropriate naming convention¶

  • For this we use package pdf2image and Poppler

Training YOLOv8 on a Custom Dataset (identify Tables)¶

This notebook is based on the ultralytics package and performs training on own custom objects and here i am using a document set and trying to tag tables from the scanned/screenshot images

Steps that are followed¶

To train the yolov8 follow steps:

  • Install YOLOv8 and all its dependencies
  • Load custom dataset here i took document-parts dataset
  • Run YOLOv8 training
  • Evaluate YOLOv8 performance
  • Run YOLOv8 inference on test images

Preparing a Custom Dataset¶

Annotating the dataset : As a important step we need to annotate the dataset. Yolov8 basically need a file which has a same name of the image and annotation text images_1.jpg images_1.txt. here images_1.txt conatins the co-ordinates of a bounding box basically defining our object to be tagged. There are some tools which we can use like LabelImg which basically will create a same textfile name as of image but manually we need to draw one/multiple bounding to tag different parts that image conatins

Split dataset basically into train,test, validation folder

The New YOLOv8 API¶

The developers of YOLOv8 decided to break away from the standard YOLO project design : separate train.py, detect.py, val.py, and export.py scripts. In the short term it will probably cause some confusion while in the long term, it is a fantastic decision!

This pattern has been around since YOLOv3, and every YOLO iteration has replicated it. It was relatively simple to understand but notoriously challenging to deploy especially in real-time processing and tracking scenarios.

The new approach is much more flexible because it allows YOLOv8 to be used independently through the terminal, as well as being part of a complex computer vision application.

Train YOLOv8 on a custom dataset¶

In [ ]:
 
Ultralytics YOLOv8.0.196 🚀 Python-3.10.12 torch-2.0.1+cu118 CUDA:0 (Tesla T4, 15102MiB)
engine/trainer: task=detect, mode=train, model=yolov8m.pt, data=/content/drive/MyDrive/yolov8/Table-Extraction-PDF-2/data.yaml, epochs=20, patience=50, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=None, workers=8, project=None, name=None, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, vid_stride=1, stream_buffer=False, line_width=None, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, boxes=True, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, tracker=botsort.yaml, save_dir=runs/detect/train4
Downloading https://ultralytics.com/assets/Arial.ttf to '/root/.config/Ultralytics/Arial.ttf'...
100% 755k/755k [00:00<00:00, 10.9MB/s]
Overriding model.yaml nc=80 with nc=2

                   from  n    params  module                                       arguments                     
  0                  -1  1      1392  ultralytics.nn.modules.conv.Conv             [3, 48, 3, 2]                 
  1                  -1  1     41664  ultralytics.nn.modules.conv.Conv             [48, 96, 3, 2]                
  2                  -1  2    111360  ultralytics.nn.modules.block.C2f             [96, 96, 2, True]             
  3                  -1  1    166272  ultralytics.nn.modules.conv.Conv             [96, 192, 3, 2]               
  4                  -1  4    813312  ultralytics.nn.modules.block.C2f             [192, 192, 4, True]           
  5                  -1  1    664320  ultralytics.nn.modules.conv.Conv             [192, 384, 3, 2]              
  6                  -1  4   3248640  ultralytics.nn.modules.block.C2f             [384, 384, 4, True]           
  7                  -1  1   1991808  ultralytics.nn.modules.conv.Conv             [384, 576, 3, 2]              
  8                  -1  2   3985920  ultralytics.nn.modules.block.C2f             [576, 576, 2, True]           
  9                  -1  1    831168  ultralytics.nn.modules.block.SPPF            [576, 576, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  2   1993728  ultralytics.nn.modules.block.C2f             [960, 384, 2]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  2    517632  ultralytics.nn.modules.block.C2f             [576, 192, 2]                 
 16                  -1  1    332160  ultralytics.nn.modules.conv.Conv             [192, 192, 3, 2]              
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  2   1846272  ultralytics.nn.modules.block.C2f             [576, 384, 2]                 
 19                  -1  1   1327872  ultralytics.nn.modules.conv.Conv             [384, 384, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  2   4207104  ultralytics.nn.modules.block.C2f             [960, 576, 2]                 
 22        [15, 18, 21]  1   3776854  ultralytics.nn.modules.head.Detect           [2, [192, 384, 576]]          
Model summary: 295 layers, 25857478 parameters, 25857462 gradients, 79.1 GFLOPs

Transferred 469/475 items from pretrained weights
TensorBoard: Start with 'tensorboard --logdir runs/detect/train4', view at http://localhost:6006/
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
Downloading https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8n.pt to 'yolov8n.pt'...
100% 6.23M/6.23M [00:00<00:00, 107MB/s]
AMP: checks passed ✅
train: Scanning /content/drive/MyDrive/yolov8/Table-Extraction-PDF-2/train/labels... 238 images, 0 backgrounds, 0 corrupt: 100% 238/238 [00:00<00:00, 244.04it/s]
train: New cache created: /content/drive/MyDrive/yolov8/Table-Extraction-PDF-2/train/labels.cache
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
val: Scanning /content/drive/MyDrive/yolov8/Table-Extraction-PDF-2/valid/labels... 70 images, 0 backgrounds, 0 corrupt: 100% 70/70 [00:00<00:00, 213.62it/s]
val: New cache created: /content/drive/MyDrive/yolov8/Table-Extraction-PDF-2/valid/labels.cache
Plotting labels to runs/detect/train4/labels.jpg... 
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
optimizer: AdamW(lr=0.001667, momentum=0.9) with parameter groups 77 weight(decay=0.0), 84 weight(decay=0.0005), 83 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 2 dataloader workers
Logging results to runs/detect/train4
Starting training for 20 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       1/20      6.88G      1.357      3.079      1.502         43        640: 100% 15/15 [00:10<00:00,  1.39it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:03<00:00,  1.32s/it]
                   all         70        109      0.318      0.561      0.373      0.295

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       2/20      7.09G     0.7192      1.578      1.094         42        640: 100% 15/15 [00:07<00:00,  1.90it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:04<00:00,  1.37s/it]
                   all         70        109      0.536       0.65      0.566      0.387

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       3/20      7.12G     0.7845      1.416      1.071         53        640: 100% 15/15 [00:09<00:00,  1.64it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:01<00:00,  2.28it/s]
                   all         70        109      0.234      0.643      0.186      0.107

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       4/20      7.09G     0.8411      1.371      1.113         47        640: 100% 15/15 [00:07<00:00,  2.00it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:02<00:00,  1.21it/s]
                   all         70        109      0.173      0.736      0.162     0.0855

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       5/20      7.11G     0.7543      1.192      1.067         55        640: 100% 15/15 [00:07<00:00,  2.07it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:01<00:00,  2.21it/s]
                   all         70        109    0.00882      0.293    0.00804    0.00437

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       6/20      7.09G     0.7599      1.094      1.072         49        640: 100% 15/15 [00:07<00:00,  2.14it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:01<00:00,  2.66it/s]
                   all         70        109     0.0318      0.373     0.0278    0.00899

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       7/20      7.13G     0.7945      1.102      1.071         61        640: 100% 15/15 [00:07<00:00,  1.97it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:01<00:00,  1.95it/s]
                   all         70        109      0.189      0.572      0.256      0.139

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       8/20      7.16G     0.7349      1.061      1.043         50        640: 100% 15/15 [00:08<00:00,  1.74it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:01<00:00,  1.96it/s]
                   all         70        109      0.917      0.217      0.468      0.322

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       9/20      7.08G     0.7576      1.022      1.044         61        640: 100% 15/15 [00:07<00:00,  2.02it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:02<00:00,  1.32it/s]
                   all         70        109      0.534      0.635      0.592      0.443

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      10/20      7.11G     0.7777      1.091      1.051         42        640: 100% 15/15 [00:07<00:00,  2.05it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:02<00:00,  1.42it/s]
                   all         70        109      0.533      0.563      0.617      0.418
Closing dataloader mosaic
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      11/20      7.12G     0.6952     0.9462      1.042         20        640: 100% 15/15 [00:10<00:00,  1.48it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:02<00:00,  1.22it/s]
                   all         70        109      0.592      0.733      0.677      0.503

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      12/20      7.12G     0.6666     0.9255      1.031         31        640: 100% 15/15 [00:07<00:00,  2.07it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:01<00:00,  1.75it/s]
                   all         70        109      0.651      0.774      0.694       0.56

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      13/20      7.09G     0.6255     0.9046     0.9958         26        640: 100% 15/15 [00:07<00:00,  1.98it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:02<00:00,  1.21it/s]
                   all         70        109      0.678      0.752      0.666      0.544

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      14/20      7.11G     0.6584     0.9611      1.012         20        640: 100% 15/15 [00:07<00:00,  1.98it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:01<00:00,  2.48it/s]
                   all         70        109       0.67      0.858      0.794      0.599

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      15/20      7.11G     0.6331     0.8373     0.9879         26        640: 100% 15/15 [00:07<00:00,  1.93it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:01<00:00,  2.83it/s]
                   all         70        109      0.704       0.63      0.558      0.466

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      16/20      7.12G     0.5783     0.7323     0.9896         22        640: 100% 15/15 [00:07<00:00,  1.91it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:01<00:00,  1.72it/s]
                   all         70        109      0.944      0.836      0.911      0.781

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      17/20      7.08G     0.5305     0.6985     0.9461         29        640: 100% 15/15 [00:07<00:00,  2.03it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:02<00:00,  1.31it/s]
                   all         70        109       0.89      0.862      0.908      0.821

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      18/20      7.11G     0.4854     0.7519     0.9068         17        640: 100% 15/15 [00:08<00:00,  1.70it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:01<00:00,  2.08it/s]
                   all         70        109      0.855       0.81      0.903      0.826

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      19/20      7.13G     0.4556     0.5932     0.9166         16        640: 100% 15/15 [00:08<00:00,  1.77it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:01<00:00,  1.71it/s]
                   all         70        109      0.935      0.854      0.936      0.873

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      20/20      7.11G     0.4525     0.6147     0.8903         19        640: 100% 15/15 [00:08<00:00,  1.86it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:01<00:00,  2.03it/s]
                   all         70        109      0.881      0.875      0.922      0.859

20 epochs completed in 0.083 hours.
Optimizer stripped from runs/detect/train4/weights/last.pt, 52.0MB
Optimizer stripped from runs/detect/train4/weights/best.pt, 52.0MB

Validating runs/detect/train4/weights/best.pt...
Ultralytics YOLOv8.0.196 🚀 Python-3.10.12 torch-2.0.1+cu118 CUDA:0 (Tesla T4, 15102MiB)
Model summary (fused): 218 layers, 25840918 parameters, 0 gradients, 78.7 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 3/3 [00:02<00:00,  1.01it/s]
                   all         70        109      0.935      0.854      0.936      0.874
              bordered         70         23      0.965      0.826      0.933      0.901
            borderless         70         86      0.905      0.882      0.938      0.848
Speed: 6.3ms preprocess, 10.1ms inference, 0.0ms loss, 3.6ms postprocess per image
Results saved to runs/detect/train4
💡 Learn more at https://docs.ultralytics.com/modes/train

Validate with a new model¶

When the training is over, we can validate the new model on images it has not seen before. Therefore, when creating a dataset, we divide it into three parts, and one of them that we will use now as a test dataset.

In [22]:
!yolo task=detect \
mode=val \
model= {model_dir} \
data= {train_data}/data.yaml
Ultralytics YOLOv8.0.196 🚀 Python-3.10.12 torch-2.0.1+cu118 CPU (Intel Xeon 2.20GHz)
Model summary (fused): 168 layers, 3006038 parameters, 0 gradients, 8.1 GFLOPs
val: Scanning /content/drive/MyDrive/yolov8/Table-Extraction-PDF-2/valid/labels.cache... 70 images, 0 backgrounds, 0 corrupt: 100% 70/70 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 5/5 [00:29<00:00,  5.81s/it]
                   all         70        109      0.943      0.905      0.966      0.913
              bordered         70         23      0.991      0.913      0.981      0.971
            borderless         70         86      0.895      0.896       0.95      0.854
Speed: 6.5ms preprocess, 215.6ms inference, 0.0ms loss, 1.3ms postprocess per image
Results saved to runs/detect/val5
💡 Learn more at https://docs.ultralytics.com/modes/val
In [ ]:
!ls /content/drive/MyDrive/yolov7/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_5.png
detect	     detect3  output_page_1.png  output_page_4.png  output_page_7.png
detect2      detect4  output_page_2.png  output_page_5.png  output_page_8.png
detect2_new  detect5  output_page_3.png  output_page_6.png
In [25]:
!yolo task=detect \
mode=predict \
model={model_dir} \
conf=0.25 \
source={output_dir_} \
name={inf_dir} \
save_txt= True
Ultralytics YOLOv8.0.196 🚀 Python-3.10.12 torch-2.0.1+cu118 CPU (Intel Xeon 2.20GHz)
Model summary (fused): 168 layers, 3006038 parameters, 0 gradients, 8.1 GFLOPs

image 1/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_1.png: 640x512 (no detections), 197.6ms
image 2/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_2.png: 640x512 (no detections), 161.1ms
image 3/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_3.png: 640x512 (no detections), 153.7ms
image 4/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_4.png: 640x512 1 borderless, 158.4ms
image 5/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_5.png: 640x512 1 borderless, 156.1ms
image 6/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_6.png: 640x512 2 borderlesss, 157.2ms
image 7/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_7.png: 640x512 1 borderless, 148.4ms
image 8/8 /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/output_page_8.png: 640x512 (no detections), 157.6ms
Speed: 4.5ms preprocess, 161.3ms inference, 1.5ms postprocess per image at shape (1, 3, 640, 512)
Results saved to /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/detect
4 labels saved to /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/detect/labels
💡 Learn more at https://docs.ultralytics.com/modes/predict

4. Extract sub Images by cutting those tables out of original images.¶

In [47]:
import cv2
import os

# Define the directory containing the images and text files

image_directory = output_dir
text_directory = inf_dir + "labels/"
output_directory = inf_dir + "TagTable/"
margin =15


# Create the output directory if it doesn't exist
os.makedirs(output_directory, exist_ok=True)

# Optionally, you can add error handling, resizing, or other processing as needed

5. Perform OCR on tables and extract the text matrix¶

In [50]:
import cv2
import pytesseract
from PIL import Image

# Load the image with tables
image_directory = output_directory
# Output directory for saving OCR results
outputocr_directory = inf_dir + "OCRTable/"

# Create the output directory if it doesn't exist
os.makedirs(output_directory, exist_ok=True)
#print(os.listdir(image_directory),image_directory)
# Loop through the image files in the directory
for image_filename in os.listdir(image_directory):
    if image_filename.endswith(".jpg"):  # Adjust the file extension as needed
        image_path = os.path.join(image_directory, image_filename)
        #print("image_path",image_path)
        image = cv2.imread(image_path)


        pil_image = Image.fromarray(image)

        # Convert the image to grayscale for better OCR accuracy
        #gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

        # Use pytesseract to perform OCR on the grayscale image
        #ocr_result = pytesseract.image_to_string(gray_image)
        
        # Generate a filename for the OCR result text file
        output_filename = os.path.splitext(image_filename)[0] + "_ocr.txt"
        output_path = os.path.join(outputocr_directory, output_filename)

        # Save the OCR result to the text file
        with open(output_path, "w") as file:
            file.write(table_str)

print("OCR results saved in the directory:", outputocr_directory)
OCR results saved in the directory: /content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/detect/OCRTable/

EXTRA CODE¶

Draw annotations for train and test data or any other file you would like having your image and txt file¶

In [53]:
import cv2
import os
from IPython.display import Image, display

# Mount Google Drive to access your data (optional, if your data is in Google Drive)
from google.colab import drive
drive.mount('/content/drive')
#os.rmdir("/content/drive/MyDrive/yolov7/document-parts-2/Annotated")
#os.mkdir("/content/drive/MyDrive/yolov7/document-parts-2/Annotated")
# Directory containing images and label files
anno_dir =train_data + "/Annotated"
data_dir = output_dir_
data_dir_labels =  inf_dir + "labels"
file_list = os.listdir(data_dir)
print(file_list)


# Counter to limit the number of images displayed
display_count = 0


print("total number of images in this directory is :", len(file_list))


# Loop through each image in the directory
for image_file in file_list:
    if image_file.endswith(".png"):  # Adjust the file extension as needed
        # Load the image
        image_path = os.path.join(data_dir, image_file)
        i
          output_image_file = os.path.join(anno_dir, "annotated_" + image_file)
          cv2.imwrite(output_image_file, image)

          # Display the annotated image using IPython
          print("Name is: ",display_count,image_file )
          display(Image(output_image_file))

          # Increment the counter
          display_count += 1

          # Check if we've displayed 100 images
          #if display_count >= 100:
           #   break  # Stop displaying images

# Print a message when the task is completed
print("Annotation and display completed.")
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
['output_page_1.png', 'output_page_2.png', 'output_page_3.png', 'output_page_4.png', 'output_page_5.png', 'output_page_6.png', 'output_page_7.png', 'output_page_8.png', 'detect']
total number of images in this directory is : 9
output_page_4.png 1 0.498138 0.767796 0.64572 0.087686

Name is:  0 output_page_4.png
output_page_5.png 1 0.499508 0.240746 0.64542 0.184216

Name is:  1 output_page_5.png
output_page_6.png 1 0.499833 0.328019 0.64716 0.102913

output_page_6.png 1 0.506315 0.728177 0.662595 0.235299

Name is:  2 output_page_6.png
output_page_7.png 1 0.502882 0.357009 0.650774 0.101709

Name is:  3 output_page_7.png
Annotation and display completed.
In [42]:
## for one image only


import cv2
import pytesseract
from PIL import Image

# Load the image with tables
image_path = "/content/drive/MyDrive/yolov8/Pdf_document/Pdf_To_Images/AutomateDocument/detect/TagTable/output_page_5_table_1.jpg"
# 'Pdf_To_Images/TagTable/output_page_12_table_1.jpg'

# Print the extracted matrix
for row in table_matrix:
    print(row)
['iable 2: summary of Keported and Matched Employment and rirm otructure']
['Percentiles']
['Variable Mean sD PS P25 P50 P75 P95']
['Reported Total Employment']
['Firm employment 307.4 753.7 1 7 57.5 179.5 | 1394.5']
['Matched Employment and Firm Structure (without using the Census Multi-unit data)']
['Employment 303.3 | 801.9 0.5 5 46.5 171 | 1324.5']
['Number of EINs 17 1.8 0.5 1 1 1 5.5']
['Number of 103 | 27.0 | 05 1 2 7 | 485']
['establishments']